Skip to content

Gaudi: add CI #3160

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 16 commits into
base: main
Choose a base branch
from
Draft

Gaudi: add CI #3160

wants to merge 16 commits into from

Conversation

baptistecolle
Copy link
Collaborator

@baptistecolle baptistecolle commented Apr 10, 2025

What does this PR do?

This PR adds CI support for the Gaudi backend. It includes an integration test that starts the model "meta-llama/Llama-3.1-8B-Instruct", performs a few requests, and verifies that the outputs match the expected results.

Additional models are also supported, but running tests for all of them is quite slow, so they are not included in the CI by default. However, instructions on how to run the integration tests for all supported models have been added to the Gaudi backend README.

@baptistecolle
Copy link
Collaborator Author

baptistecolle commented Apr 22, 2025

I’ll wait for the Gaudi integration test CI to pass before merging anything:
https://github.com/huggingface/text-generation-inference/actions/runs/14591230970/job/40927197928?pr=3160

The previous run was green, which gives me confidence in the current changes:
https://github.com/huggingface/text-generation-inference/actions/runs/14384130453/job/40336095297

Unfortunately, it can take days to get assigned a Gaudi1 runner 😭, so I figured I could start iterating on your reviews in the meantime rather than wait for the CI to finish before requesting feedback. In any case, I’ll only merge once the Gaudi integration test passes in the CI also

@baptistecolle baptistecolle marked this pull request as ready for review April 22, 2025 10:01
Copy link
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

We should soon have access to Gaudi2 and Gaudi3 ephemeral runners on demand, which will makes things much easier than waiting for a DL1 instance. I suggest we wait for this to be available to update and merge this PR.

@baptistecolle
Copy link
Collaborator Author

Ok, I will wait for the new runners before adding Gaudi to the CI, as indeed the DL1 runners are super unreliable

@baptistecolle baptistecolle marked this pull request as draft April 23, 2025 07:42
Narsil
Narsil previously approved these changes Apr 23, 2025
Copy link
Collaborator

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@baptistecolle
Copy link
Collaborator Author

The runners for Gaudi are ready! 🙌 Thanks @regisss

Just requesting some new reviews to be sure everything is still okay. Since the last review I just rebased on main and use the new runners. Now the integration test are passing and the runners are super fast! https://github.com/huggingface/text-generation-inference/actions/runs/15160963395/job/42627380206?pr=3160

@baptistecolle baptistecolle marked this pull request as ready for review May 21, 2025 11:49
@@ -129,9 +129,9 @@ jobs:
export label_extension="-gaudi"
export docker_volume="/mnt/cache"
export docker_devices=""
export runs_on="ubuntu-latest"
export runs_on="itac-bm-emr-gaudi3-dell-1gaudi"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All tests are going to pass with 1 device only? Big (i.e. 70B+ parameters) models are not tested?

Copy link
Collaborator Author

@baptistecolle baptistecolle May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I disable big models for testing and only did a small model for faster iteration. I just activated back a mutli-card test and the test is broken 😬. There seems to be a regression between the original PR and the latest TGI backend so I am looking into it 👀. Also the error is different based on the hardware Gaudi 1 vs 3 😣

@regisss
Copy link
Collaborator

regisss commented May 22, 2025

@baptistecolle A couple of questions:

  • It's not possible to select a specific runner for each test config right?
  • If I want to add a new model to test, I just need to add a new test config in test_gaudi_generate.py?

@baptistecolle
Copy link
Collaborator Author

baptistecolle commented May 22, 2025

@baptistecolle A couple of questions:

  • It's not possible to select a specific runner for each test config right?
  • If I want to add a new model to test, I just need to add a new test config in test_gaudi_generate.py?
  1. No it is not i think this would require some rework of the build workflow which is global for all the hardwares. The best alternative would be to use a runner with 8 card and then set HABANA_VISIBLE_DEVICES=1
  2. Yes, that's correct.

Some additional useful remark: you also need to add the new config with "run_by_default": True

to run in the CI as there a lot of test, for faster CI testing I only run a subset of the test on the CI and not all the possible model we support

@baptistecolle baptistecolle marked this pull request as draft May 22, 2025 07:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants